Goto

Collaborating Authors

 clustering and outlier detection


A Practical Algorithm for Distributed Clustering and Outlier Detection

Neural Information Processing Systems

We study the classic k-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively. To the best of our knowledge, this is the first practical algorithm with theoretical guarantees for distributed clustering with outliers. Our experiments on both real and synthetic data have demonstrated the clear superiority of our algorithm against all the baseline algorithms in almost all metrics.


On Integrated Clustering and Outlier Detection

Lionel Ott, Linsey Pang, Fabio T. Ramos, Sanjay Chawla

Neural Information Processing Systems

The advantages of combining clustering and outlier selection include: (i) the resulting clusters tend to be compact and semantically coherent (ii) the clusters are more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable. We provide a practical subgradient-based algorithm for the problem and also study the theoretical properties of algorithm in terms of approximation and convergence. Extensive evaluation on synthetic and real data sets attest to both the quality and scalability of our proposed method.


Reviews: A Practical Algorithm for Distributed Clustering and Outlier Detection

Neural Information Processing Systems

The paper addresses the problem of performing the distributed k-mean/median clustering in the presence of outliers, and at the same time identifying the outliers. Data are partitioned across multiple sites either adversarially or randomly, and the sites and a central coordinator work jointly by communications to get the data clustering and the outliers. The authors proposed a practical algorithm with bounded running time O(max{k,\log n} n), and bounded communication cost O(s(k\log n t)) and O(sk\log n t) for adversarial and random data partitioning respectively, for a dataset with n data points, k centers, t outliers, and partitioned across s sites. They used a traditional two-level clustering framework (Guha et al. 2017). If using a \gamma -approximation algorithm for (k,t)-mean/median as the second-level clustering algorithm, their distributed algorithm has a bounded O(\gamma) approximation factor. Extensive experimental studies were conducted to compare the performance of their algorithm with three baseline algorithms.


A Practical Algorithm for Distributed Clustering and Outlier Detection

Chen, Jiecao, Azer, Erfan Sadeqi, Zhang, Qin

Neural Information Processing Systems

We study the classic k-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively. To the best of our knowledge, this is the first practical algorithm with theoretical guarantees for distributed clustering with outliers. Our experiments on both real and synthetic data have demonstrated the clear superiority of our algorithm against all the baseline algorithms in almost all metrics.


On Integrated Clustering and Outlier Detection

Ott, Lionel, Pang, Linsey, Ramos, Fabio T., Chawla, Sanjay

Neural Information Processing Systems

We model the joint clustering and outlier detection problem using an extension of the facility location formulation. The advantages of combining clustering and outlier selection include: (i) the resulting clusters tend to be compact and semantically coherent (ii) the clusters are more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable. We provide a practical subgradient-based algorithm for the problem and also study the theoretical properties of algorithm in terms of approximation and convergence. Extensive evaluation on synthetic and real data sets attest to both the quality and scalability of our proposed method.